Can supervised deep learning architecture outperform autoencoders in building propensity score models for matching?

Purpose Propensity score matching is vital in epidemiological studies using observational data, yet its estimates relies on correct model-specification. This study assesses supervised deep learning models and unsupervised autoencoders for propensity score estimation, comparing them with traditional methods for bias and variance accuracy in treatment effect estimations. Methods Utilizing a plasmode simulation based on the Right Heart Catheterization dataset, under a variety of settings, we evaluated (1) a supervised deep learning architecture and (2) an unsupervised autoencoder, alongside two traditional methods: logistic regression and a spline-based method in estimating propensity scores for matching. Performance metrics included bias, standard errors, and coverage probability. The analysis was also extended to real-world data, with estimates compared to those obtained via a double robust approach. Results The analysis revealed that supervised deep learning models outperformed unsupervised autoencoders in variance estimation while maintaining comparable levels of bias. These results were supported by analyses of real-world data, where the supervised model’s estimates closely matched those derived from conventional methods. Additionally, deep learning models performed well compared to traditional methods in settings where exposure was rare. Conclusion Supervised deep learning models hold promise in refining propensity score estimations in epidemiological research, offering nuanced confounder adjustment, especially in complex datasets. We endorse integrating supervised deep learning into epidemiological research and share reproducible codes for widespread use and methodological transparency. Supplementary Information The online version contains supplementary material available at 10.1186/s12874-024-02284-5.


B Outcome Model-specification in the Plasmode Simulation
The data generating model for the plasmode simulation includes • main effects of all covariates stated above, • polynomial terms of the age variable up to the third order (age + age^2 + age^3) • the PaO2/FIO2 ratio up to the second order (pafi1 + pafi1^2), • a second-order interaction term between heart rate and mean blood pressure (hrt1 * meanbp1), and • a third-order interaction among Glasgow Coma Score, Hematocrit, and Sodium (scoma1:hema1:sod1), • the exponentiation of weight (exp(wtkilo1)) and • the cosine of the APACHE score (cos(aps1)).

C Model Architecture for the Autoencoders
The autoencoder was designed as a sequential model with a mirrored encoder-decoder architecture.The encoder consisted of two dense layers with 100 and 50 neurons, each using ReLU activation.The bottleneck layer, crucial for feature compression, had 30 neurons with ReLU activation.The decoder mirrored the encoder, expanding the compressed features back to the original dimension.The output layer used sigmoid activation with neurons equal to the number of input features.
The dataset was divided into training (80%) and testing (20%) subsets for model training and internal validation.The model was compiled using mean squared error as the loss function and the Adam optimizer with an initial learning rate.Key callbacks included early stopping to prevent overfitting, learning rate scheduling for adaptive learning rate adjustments, model checkpointing to save the best model, and logging of training metrics.
The autoencoder was trained over 150 epochs with a batch size of 32.The model's performance was evaluated on the test set, assessing reconstruction accuracy.Post-training, the learning rate was updated for potential subsequent training phases.Feature extraction was conducted by isolating the last layer's output, providing a compressed representation of the input data.These features were then utilized in propensity score model building.C Comparing with previously suggested autoencoders: The provided autoencoder design by Weberpals et al. (2021) focuses on reconstructing input data through a series of dense encoding and decoding layers, without incorporating additional design features such as dropout, regularization, normalization, and kernel initialization.Similarly, our original autoencoder model also lacks these design features, relying on a straightforward dense layer architecture for encoding and decoding.

D Model Architecture for the Supervised Deep Learning Model
The architecture of supervised model comprised multiple dense layers with 100, 50, 30, and 20 neurons, respectively.These layers utilized ReLU activation and He normal kernel initialization to promote effective learning.To prevent overfitting and enhance the model's generalization capability, we incorporated L2 regularization in the dense layers, specifically in the first layer with a regularization factor of 0.01.This regularization technique penalizes the weights' magnitude, encouraging smaller, more robust weight values.
Furthermore, to mitigate overfitting, dropout layers with a rate of 0.3 were interspersed between the dense layers.Batch normalization was also employed to stabilize the learning process by normalizing the inputs to each layer.The final layer, aimed at binary classification, consisted of a single neuron with sigmoid activation.
The model was compiled with binary crossentropy loss, Adam optimizer, and accuracy as the performance metric.To prevent overfitting and enhance model generalization, we incorporated several key components: dropout layers with a dropout rate of 0.3, L2 regularization with a penalty coefficient of 0.01, batch normalization layers to stabilize and accelerate training, and the He normal initializer for kernel initialization, which is particularly suited for layers with ReLU activations.The hyperparameters, such as learning rate and batch size, were fine-tuned using a validation dataset derived from the simulation data.
The number of layers and units per layer were chosen based on preliminary experiments and cross-validation to balance complexity and performance.For instance, the architecture typically included layers with specified units, which provided a good trade-off between model capacity and overfitting risk.E.1).Specifically, we added dropout layers with a dropout rate of 0.3 after each dense layer to prevent overfitting by randomly dropping units during training.We incorporated L2 regularization with a penalty coefficient of 0.01 in the dense layers to penalize large weights and improve generalization.Additionally, we applied batch normalization layers after selected dense layers to normalize the activations and accelerate training convergence.For kernel initialization, we used the He normal initializer for the dense layers to set the initial weights.These modifications ensure that our autoencoder benefits from the same regularization techniques as the deep learning model, providing a robust and fair comparison between the two methodologies.

Appendix Table
Our original deep learning architecture is described in Appendix

E.2 Rare Exposure and Frequent Outcome
Appendix Table E.2: Performance measures of the 4 different propensity score matching methods from the plasmode simulation based on the Right Heart Catheterization (RHC) study in the presence of a rare exposure (prevalence 5%) and a frequent outcome (prevalence 30%).The results (point Estimate of performance measures, and Monte Carlo Standard Errors) are derived from 1, 000 sets of plasmode simulation data, each with a sample size of 3, 500.The true target parameter was set to an odds ratio of 0.7.

E.6 Comparing Standard Errors
Appendix

F
Step-by-step Software Guide for Reproducing the Analysis Results

Data Preparation and Summary
Here we show reproducible codes of how one can conduct the analyses described in the manuscript.For the following analysis, we used the following R packages: caret , Hmisc , factoextra , gt , gtsummary , smd , ggplot2 , dplyr , MatchIt , cobalt , jtools , earth , keras , tensorflow , checkmate , SuperLearner and tmle .

Data Download
The right heart catheterization (RHC) dataset is obtained from hbiostat.org/data(http://hbiostat.org/data),courtesy of the Vanderbilt University Department of Biostatistics.
# Set a seed for reproducibility of random operations set.seed(123)# Read the RHC dataset from the URL and store it within R rhc <-read.csv("https://hbiostat.org/data/repo/rhc.csv")

Data Preparation
We converted the treatment variable to a binary indicator to represent the administration of RHC, with 'RHC' coded as 1 and all other values as 0.
Similarly, the death variable was transformed into a binary outcome variable indicating 30-day mortality, with 'Yes' responses coded as 1.The categorical variables (race, sex, primary disease category, cancer status) were then recoded to appropriate categories.Adjustment variables of interest included demographic, clinical, and comorbidity information.

Analysis
Steps In all of the following analyses, 4 steps were followed to conduct the propensity core analysis (see Appendix Table 2).

Appendix Table 2:
Steps of propensity score analysis.
Step Description Step 1 Exposure (RHC) modelling based on identified characteristics/covariates, or some function of them.This modelling can be done using various statistical and machine learning methods to estimate the probability (propensity score) that a patient would receive RHC based on their characteristics.The predicted probabilities from this regressions are the propensity scores.
Step 2 After estimating propensity scores, patients who received RHC are matched with those who did not, based on their propensity scores.We used the nearest neighbor matching method at a 1:1 ratio.A caliper ensures that matched pairs are similar in their probabilities.
Step 3 The balance of covariates in the matched dataset is assessed to ensure that the matching process has successfully created comparable treated and untreated groups.Standardized mean differences (SMD) are commonly used to measure such balance.SMD values below 0.25 was used as an indication of good balance.
Step 4 The final step involves modelling the outcome as a function of the treatment (RHC), while adjusting for covariates.This double adjustment approach, where covariates are adjusted in both the propensity score model and the outcome model, helps to account for any residual confounding.
Details about step 1 (exposure modelling) We will be using 4 different approaches to estimate the propensity scores in step 1: 1. Logistic regression: A traditional statistical method where the log odds of receiving treatment/being exposed (RHC) are modelled as a linear combination of covariates.2. Multivariate Adaptive Regression Splines (MARS): A non-parametric regression technique that can model complex, non-linear relationships between the covariates and the treatment assignment.3. Autoencoders: A type of neural network used for unsupervised learning of efficient codings, which can be used to reduce the dimensionality of covariates and capture non-linear interactions before estimating propensity scores.4. Deep learning: Utilizing deep neural networks to model the treatment assignment process, capable of capturing complex, non-linear relationships between the covariates and the treatment.

Remaining steps
Rest of the steps (2-4) remain the same after estimating the propensity scores: Targeted Maximum Likelihood Estimation We have also added a Targeted Maximum Likelihood Estimation (TMLE) step, so that we can compare our matching estimates with estimates from a double robust approach.
Figure E.2: Plots comparing empirical and model-based standard errors (point Estimates and Monte Carlo Standard Errors of corresponding performance measures) from four propensity score estimation methods in the plasmode simulation under different scenarios: Logistic Regression (PS), Autoencoders (AE), Deep Learning (DL), and Multivariate Adaptive Regression Splines (MARS).The results are derived from 1, 000 sets of plasmode simulation data.
Deep Learning Model Architecture and Hyperparameter Selection: In our study, the architecture of the supervised Deep Learning model was carefully designed based on established best practices and empirical testing.The architecture includes multiple hidden layers with ReLU activation functions, which are known for their efficiency in training deep networks by mitigating the vanishing gradient problem.

.1 Frequent Exposure and Frequent Outcome Our
original autoencoder architecture is described in Appendix TableC.1.To enhance the comparability between our deep learning model ('DL') and the autoencoder, we introduced several regularization techniques and hyperparameters to the original autoencoder architecture (and named the new model as 'AE.o'; see Appendix Table D.1: Model Architecture for the Supervised Deep Learning Model TableD.1.In the analysis, we added a new deep learning model by removing several features to simplify the architecture and ensure a more straightforward comparison with the autoencoder ('AE').Specifically, we eliminated dropout layers, which are typically used to prevent overfitting by randomly dropping units during training.We also removed regularization terms, such as L2 regularization, which help to penalize large weights and reduce model complexity.Additionally, we excluded batch normalization layers, which are often used to normalize the input of each layer to improve training stability and speed.Finally, we opted not to use specific kernel initialization methods for the dense layers, which are generally employed to set the initial weights of the model.These changes result in a more basic neural network structure, allowing us to focus on the core performance of the model without the influence of these additional regularization and optimization techniques.We named this model as 'DL.n' (see Appendix TableE.1).

4 Large Sample size Appendix Table E.4:
Performance measures of the 4 different propensity score matching methods from the plasmode simulation based on the Right Heart Catheterization (RHC) study in the presence of a frequent exposure (prevalence 30%) and a frequent outcome (prevalence 30%).The results (point Estimate of performance measures, and Monte Carlo Standard Errors) are derived from 1, 000 sets of plasmode simulation data, each with a sample size of 5, 000.The true target parameter was set to an odds ratio of 0.7.PS = propensity score based on logistic regression; AE = propensity scores based on Autoencoders; DL = propensity scores based on supervised deep learning; MARS = Propensity scores based on Multivariate Adaptive Regression Splines.SE = Standard Error; MSE = Mean Squared Error; Coverage = Coverage for nominal 95% Confidence Interval.Arrows indicate the direction of the change in performance measures relative to the PS method: ↑ indicates an increase, and ↓ indicates a decrease.
PS = propensity score based on logistic regression; AE = propensity scores based on Autoencoders; DL = propensity scores based on supervised deep learning; MARS = Propensity scores based on Multivariate Adaptive Regression Splines.SE = Standard Error; MSE = Mean Squared Error; Coverage = Coverage for nominal 95% Confidence Interval.Arrows indicate the direction of the change in performance measures relative to the PS method: ↑ indicates an increase, and ↓ indicates a decrease.E.3 Frequent Exposure and Rare OutcomeAppendix TableE.3:Performancemeasures of the 4 different propensity score matching methods from the plasmode simulation based on the Right Heart Catheterization (RHC) study in the presence of a frequent exposure (prevalence 30%) and a rare outcome (prevalence 5%).The results (point Estimate of performance measures, and Monte Carlo Standard Errors) are derived from 1, 000 sets of plasmode simulation data, each with a sample size of 3, 500.The true target parameter was set to an odds ratio of 0.7.PS = propensity score based on logistic regression; AE = propensity scores based on Autoencoders; DL = propensity scores based on supervised deep learning; MARS = Propensity scores based on Multivariate Adaptive Regression Splines.SE = Standard Error; MSE = Mean Squared Error; Coverage = Coverage for nominal 95% Confidence Interval.Arrows indicate the direction of the change in performance measures relative to the PS method: ↑ indicates an increase, and ↓ indicates a decrease.E.SE = Standard Error; MSE = Mean Squared Error; Coverage = Coverage for nominal 95% Confidence Interval.Arrows indicate the direction of the change in performance measures relative to the PS method: ↑ indicates an increase, and ↓ indicates a decrease.E.5 Null EffectAppendix TableE.5:Performance measures of the 4 different propensity score matching methods from the plasmode simulation based on the Right Heart Catheterization (RHC) study in the presence of a frequent exposure (prevalence 30%) and a frequent outcome (prevalence 30%).The results (point Estimate of performance measures, and Monte Carlo Standard Errors) are derived from 1, 000 sets of plasmode simulation data, each with a sample size of 3, 500.The true target parameter was set to an odds ratio of 1.

Table 1 :
Summary of the variables stratified by the Right Heart Catheterization (RHC) Status, and the associated standardized mean differences.
We prepare the summary table based on the prepared data.The table is stratified by RHC status.We also reported the standardized mean differences for each covariate.

Table 1 :
Summary of the variables stratified by the Right Heart Catheterization (RHC) Status, and the associated standardized mean differences.Dummy variables were generated for categorical features.Continuous variables were preprocessed using centering and scaling to normalize their distributions.The preprocessed data, consisting of binary and normalized continuous variables, was combined into a single dataset.#Separatecontinuous and binary variables based on the above identificationsel.continuous.col<-names(sel.col[sel.col== FALSE]) sel.binary.col<-names(sel.col[sel.col== TRUE]) # Subset the data frame to include only binary variables rhc_prep1 <-rhc_prep[sel.binary.col] # Create dummy variables for categorical features in 'rhc2' for regression analysis dmy <-dummyVars(" ~ .",data = rhc2, fullRank = TRUE) # Apply the dummy variable transformation to 'rhc2' and store the result in a new data frame 'rhc_prep' rhc_prep <-data.frame(predict(dmy,newdata = rhc2))

Table 3 :
Summary of effect of RHC on death outcome based on two approaches (propensity score matching and Targeted Maximum Likelihood Estimation [TMLE]), when the propensity scores were estimated via a logistic regression.The outcome regression from TMLE was fitted via a super learner (with logistic regression and MARS as candidate learners) with 5 fold cross-validation.Exposure modeling # Fit a MARS to estimate propensity scores based on exposure and covariates PS.fit <-earth(ps.formula,data=rhc_prep3) # Step 1:

Table 4 :
Summary of effect of RHC on death outcome based on two approaches (propensity score matching and Targeted Maximum Likelihood Estimation [TMLE]), when the propensity scores were estimated via a Multivariate Adaptive Regression Splines (MARS) regression.The outcome regression from TMLE was fitted via a super learner (with logistic regression and MARS as candidate learners) with 5 fold crossvalidation.